Medical transcription with dynamic language models

ABSTRACT

A computer-implemented method for transcribing spoken input into text is disclosed. The method includes identifying a role of a healthcare provider using an automated transcription system, using the identified role to provide a language model for a speech recognition system operating on computer system, receiving spoken input from the healthcare provider, and producing textual output corresponding to the spoken input using the provided language model.

TECHNICAL FIELD

This document relates to automatic transcription of medical information provided verbally by healthcare professionals.

BACKGROUND

Physicians and other healthcare professionals spend much of their time recording information about patients rather than on actually caring for patients. Such information generally takes the form of observations that are then recorded as part of the patient's medical record so that later attending caregivers can be kept up-to-date on the patient's prior condition and care. Popular understanding envisions the medical record as a form in a clipboard that is kept in a pocket at the foot of a patient's bed or outside a exam room, and on which the caregiver scribbles with a pen before leaving the patient's room.

Electronic medical records (EMRs) have largely replaced traditional paper-based records in most healthcare systems. But healthcare workers are not particularly adept typists, nor do they want to have to carry a keyboard with them or move to a keyboard every time they have to enter data into an EMR. As a result, physicians frequently dictate their observations into small electronic tape or flash-based voice records and provide the tapes or files to trained transcriptionists for manual typing of the information. Such transcription can be costly, can create problems in ensuring that the proper information is joined to the proper patient's record, and includes inherent delays.

Thus, automated transcription has been used, with programs such as DRAGON NATURALLY SPEAKING and VIA VOICE providing particular healthcare-related modules. Still, such automated transcription can make errors, even after a healthcare provider has spent time “training” the system to learn his or her voice.

SUMMARY

This document describes systems and techniques that may be used for automated transcription of verbally-provided healthcare information. In certain examples discussed below, layered dynamic contextual language models, or grammars, are used with a speech-recognition system to improve the accuracy of transcription by such a system. The language models are dynamic because they can vary between individual healthcare providers, and not merely in ways that depend on the personal speech characteristics of the user, and can also vary from time to time as a particular healthcare provider operates the system. The models are contextual because they change based on things that the user is doing with a medical records system. For example, one model can be used when a physician is discussing drugs that have been prescribed to a patient, and another model can be used when the same physician is describing the current condition, such as vital signs, of a patient. The context can also depend on the patient themselves, where the language model can be trained on a copy of the particular patient's medical history, so that, for example, complicated medical terms or names of other caregivers that appear in the record can be elevated in importance in a language model, and can thus be recognized much more readily than if the medical record were not analyzed.

In certain implementations, such features may provide one or more advantages. For example, a speech recognition system may provide more accurate translation, whether it is trained to a user's voice or not. The system may be implemented with exiting voice recognition systems, by providing a mechanism for dynamically changing and supplementing models that may be accessed by such system, but without a need to substantially reprogram the systems themselves. In addition, the features may be implemented relative inexpensively with a general speech recognition engine, simply by providing particular context-specific language model elements. Moreover, the features may be transparent to the end user so that special training is not required (aside from potential voice training on the system, which is typical with more general speech recognition systems).

In one implementation, a computer-implemented method for transcribing spoken input into text is discussed. The method comprises identifying a role of a healthcare provider using an automated transcription system, using the identified role to provide a language model for a speech recognition system operating on computer system, receiving spoken input from the healthcare provider, and producing textual output corresponding to the spoken input using the provided language model. Identifying the role of the healthcare provider can comprise receiving, by the computer system, a login from the healthcare provider, identifying the healthcare provider from the login, and associating a role with the identified healthcare provider. Also, using the identified role to provide the language model for the speech recognition system can comprise selecting a specialty-based language model that corresponds to a medical specialty for the role of the healthcare provider. The method can also include combining the selected specialty-based language model with a general language model to create the provided language model, and providing the textual output to an electronic medical record system. Moreover, the method can include providing with the textual output an identifier for a healthcare patient to which the healthcare provider's spoken input was directed.

In some aspects, the method further comprises identifying a patient for which the healthcare provider is providing spoken input, and generating language model data directed to information that corresponds to the patient. Generating the language model data can comprise analyzing entries in prior medical record entries for the patient, and the method can also include using training data for the voice of the healthcare provider to provide the language model. Moreover, the method may comprise identifying information that corresponds to a category that the healthcare provider is providing spoken input to, and modifying the language model in response to identifying the information that corresponds to the category. And the category can comprise a field into which the healthcare provider is providing spoken input.

In yet another implementation, a computer-implemented system for transcribing spoken input into text is described, and comprises a voice input interface to receive spoken input from a healthcare provider, a language model builder in a computer system operable to identify a workplace role of the healthcare provider and to produce a language model using the identified workplace role, and a speech-to-text converter programmed to convert the received spoken input into textual output using the language model produced by the language model builder. The language model builder can be programmed to identify the role of the healthcare provider by receiving login information for healthcare provider, identify the healthcare provider from the login information, and associate a role with the identified healthcare provider, and to the identified role to provide the language model for the speech recognition system by selecting a specialty-based language model that corresponds to a medical specialty for the role of the healthcare provider. Also, the language model builder can be programmed to combine the selected specialty-based language model with a general language model to create the produced language model.

In certain aspects, the speech-to-text converted is further programmed to provide the textual output to an electronic medical record system. Also, the system can be further programmed to provide with the textual output an identifier for a healthcare patient to which the healthcare provider's spoken input was directed. The language model builder can be programmed to identify a patient for which the healthcare provider is providing spoken input, and generate language model data directed to information that corresponds to the patient. Also, generating the language model data can comprise analyzing entries in prior medical record entries for the patient, and the language model builder can be further programmed to use training data for the voice of the healthcare provider to produce the language model. Moreover, the speech-to-text converter can be further programmed to identify an information category into which the healthcare provider is providing spoken input, and modifying the language model in response to identifying the information that corresponds to the category. The category can correlate to a field into which the healthcare provider is providing spoken input.

In another implementation, a computer-implemented system for transcribing spoken input into text is disclosed and includes a voice input interface to receive spoken input from a healthcare provider; a speech-to-text converter programmed to convert the received spoken input into textual output using a language model; and means for creating the language model at least in part by identifying a workplace role of the healthcare provider. The means for creating the language model can be programmed to include user-specific data to create the language model in addition to the workplace role of the healthcare provider.

The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF THE FIGURES

FIG. 1 shows a conceptual diagram of an automated medical transcription system that uses layered dynamic contextual language models.

FIG. 2 is a schematic diagram of a system for providing automated speech-recognition and medical transcription.

FIG. 3 is a flow chart of an automated transcription process.

FIG. 4 is a swim lane diagram of a process showing actions in an automated transcription process that may be carried out by the cooperation of multiple sub-systems.

FIG. 5 shows schematic diagrams of a general computer system and a general mobile computing device that may implement the techniques described in this document.

DETAILED DESCRIPTION

This document describes mechanisms by which a computer system may transcribe spoken input from a healthcare provider into text for inclusion in an electronic medical record (EMR) or similar document. The description discusses in particular the way that language models may be manipulated so as to provide more accurate speech recognition in such a setting. In one technique described here, various speech models, generally having different levels of specificity, can be layered onto each other so as to best match a present context in which a speech recognition system is being used. The system may start with a general language model that may represent the prevalence of spoken words in the English (or other) language, and a model directed to healthcare. The system may then add a speech model for a particular medical specialty for a user of the system, after determining the role of the user and thereby the specialty of the user. The system may also identify the patient so as to obtain an EMR for the patient, and thereby add words from the EMR to the language model. As the healthcare provider moves through an application, the system may further add to the language model, such as by identifying a record or field in which a cursor is located (i.e., a location or context in which text is being added to the system) and may load model data for that particular record or field.

FIG. 1 shows a conceptual diagram of an automated medical transcription system 100 that uses layered dynamic contextual language models. In general, the system 100 centers around a healthcare provider 102, here labeled as Dr. Lisa Smith, who is a cardiologist. Dr. Smith wishes to provide charting information and other medical record information to the system 100, and to interact with an EMR system such as by providing commands to navigate such a system.

The healthcare provider 102 in this example provides voice input via a microphone 106 connected to a computing device 104. The computing device 104 may have additional peripherals, such as a keyboard and mouse, which the healthcare provider 102 may use to interact with the device 104 in additional, complementary or alternative, manners. The device 104 may take a variety of forms, such as a desktop personal computer, a laptop, a netbook, a smart phone, or an iPad-like device. The device 104 may perform the speech-recognition described herein, or may provide the speech for processing by another device, such as by digitizing incoming speech as one or more speech files or streams, and providing the digitized sounds to a remote server system (including over the internet), where the processing may occur, such as occurs with GOOGLE VOICE SEARCH.

The system 100 in this example relies, among other things, on a variety of language models 110 that are available to the device 104 in order to produce an improved speech-to-text conversion of the healthcare provider's 102 spoken input. As shown here, the models 110 include, at the most general level, a general model 112. The general model may, in known manners, define data for resolving speech-provided input into text that normally occurs in the given language, which in this example is English. The general model 112 may include data that reflects the generalized or normalized frequency of occurrence of particular terms—either words, parts of words, or phrases—in the language, and may weight each such term according to the frequency of occurrence. Thus, when the model is consulted, words having a higher frequency of occurrence may be provided as a more likely match to a sound that is under consideration for conversion by a speech-to-text convertor of the system 100. The general model 112 may have been trained using various general sources, such as public documents (e.g., web pages), dictionaries, recorded speech with accompanying transcripts (e.g., from television closed-captioning) and the like.

The general model 112 may also be supplemented with general data for healthcare providers. Such data may represent the frequency of usage in the healthcare community, and the data may be generated using medical-related document corpuses (e.g., from Medline database and similar), medical dictations files and associated transcriptions, and the like. Thus, for example, the extra healthcare-related data may disambiguate spoken speech input so as to weight “forceps” higher than “force it”, when the latter may be weighted more highly outside the healthcare context with a more general language model.

Other language model data is provided by specialties models 114. The specialties models 114 each represent usage in a particular specialty. For example, cardiologists may have particular terms that they employ often, and that are seldom used by endocrinologists. One example may be “stent” over other terms that might be candidates, such as “spent” or “scent” or “cent.” Each of the models 114 may be individually accessed and applied by the system for a particular situation, as discussed below.

In addition, the models 112 may include contextual models 116, which may depend on a particular context that a particular healthcare provider 102 is in at a current moment. For example, the system 100 may know from the area in which a healthcare provider 102 is entering data, that the healthcare provider 102 is describing information about a family history. As a result, in such a situation, data for a family history data model may be added to a data model that the system 100 is presently employing to convert speech to text, so that terms relating to family history are emphasized for selection. For example, “past” may be elevated over “passed,” whereas the order may be the opposite in other contexts (e.g., when a gastroenterologist is providing a patient assessment).

These models may be combined into a finished language model, in this example, in the form of a dynamic layered language model 108 that is applied to spoken input provided by the healthcare provider 102. As shown, the language model 108 is a combination of the language models just discussed plus model data for the particular patient who is being treated (or more particularly, for whom the healthcare provider 102 is providing spoken input). In particular, terms may be extracted from an EMR for the patient whose record is currently being added to by the healthcare provider 102, and those terms may be added to a language model. Such additional data that is dynamically added to the language model may be particularly helpful in resolving uncommon words such as the names of other treating physicians for the patient.

Thus, in one example of steps for the operation of system 100, healthcare provider 102 Dr. Smith may approach the device 104. The device 104 may already have a general language model established for interacting with it, and if no other relevant events occur, may use that model. The healthcare provider 102 may then log into the system, such as by swiping an ID card past a bar code reader or by swiping a finger past a fingertip reader. Such a gesture may cause information specific to the particular healthcare provider 102, Dr. Smith, to be gathered from a server system, such as a user ID and other information. One such piece of information may be a specialty in which the healthcare provider 102 practices, and in this case Dr. Smith is a cardiologist. It may also be relevant that Dr. Smith is a physician as opposed to a nurse. A role for the healthcare provider 102 may thus be defined and can include the healthcare provider's 102 title, their area of specialty, or a combination of the two.

The healthcare provider 102 may then identify the patient for whom they are providing or seeking information. Such an action may occur, for example, by scanning a bar code on a patient wristband, scanning a barcode on a paper record, locating a patient from a list, typing in a patient ID or name, or in other known manners. When the healthcare provider 102 logs in and provides a patient, the role for the healthcare provider 102 may be used to retrieve an appropriate specialty model 114, either from memory on the device 104 or at a remote server (which may be returned to the device 104 if the speech recognition is to occur at the device 104, or may be kept at the server otherwise). Also, the presence of a patient ID may be used to obtain text from the patient's EMR, and perhaps meta data for such text (e.g., the names of or labels on fields in the EMR). With the text retrieved, it may be analyzed and added to a speech-recognition model such as model 108.

The healthcare provider 102 may then begin providing data to the system 100 by speaking into microphone 106. As the speaker moves a cursor around an interface in the system 100, the system 100 may maintain information indicating the status of such interaction. For example, the healthcare provider 102 may move between different screens or forms in a record keeping application. Each such screen or form may be associated with a particular vocabulary, such as a list of words that are relevant to the form or screen. The list may be created manually, such as by an author of the form or screen, in an effort to identify all words that may be typed or spoken into the form or screen. The list may also be created by performing a textual analysis on the form or screen. For example, labels for fields or names of the fields may be identified in a familiar manner, and may be used to identify words that are associated with such as label. As a non-medical example, if the label for a field is “state,” the names of each of the 50 states may be used as a list of words.

Different weights may be given in an overall language model to the various sub-models in the model. For example, contextual information and EMR data may be given a highest weighting because such data is most directly connected to the action that a healthcare provider 102 is taking at a given moment. Data for the particular healthcare provider 102 specialty may be given a middle weighting because it is directly relevant to the normal terminology used by the particular healthcare provider 102, but it will tend to cover a relatively broad area, and thus not be as targeted as information about the particular patient and about the particular context in which the healthcare provider 102 is interacting with a user interface for the system 100. The general medical and general data may receive relatively lower weight, as it too is more general, and is not as likely to be helpful in resolving ambiguities by a recognition engine (e.g., it might not rank “blemish” higher than it ranks “Flemish”).

Thus, using such an example system 100, speech-recognition that is targeted to a particular user and situation may be employed, so as to achieve more accurate speech-to-text conversion. The system 100 may be flexibly deployed, so that particular speech-to-text applications can dynamically gather the appropriate data automatically, without human intervention, and at run-time as it is needed. The system may also be employed so that recognition occurs on a server system and recognized text is passed to the device 104, particularly where the system 100 has high network capacity and server capability.

FIG. 2 is a schematic diagram of a system 200 for providing automated speech-recognition and medical transcription. The system 200 provides one example implementation for the features discussed above and below. In this example, in general, a computing device 202 communicates over a network 204 to obtain patient data from a medical records server 208 and speech recognition data from a speech server 206.

The network 204 may take a variety of forms, such as a LAN or WAN, and may also include portions of the internet. As such, various models for the system 200 may be employed. For example, a hosted “cloud-based” model may be used, by which a medical record provider may be accessed over the internet, and a speech recognition provider may also be accessed. In such a model, for example, the servers for the speech recognition provider may be provided with sound files or sound streams of spoken input by various users, along with meta data (e.g., ID or role of a user, name of a field in which the user is entering information, and the like), and may return text that a client device may simply insert into an application at the current cursor location. Such an implementation may also be fully hosted, so that, for example, a particular medical practice need only operate wireless computing devices that can access the internet (e.g., smartphones or 3G-equipped tablets or pads).

Thus, in this example, device 202 is a touch screen pad computer that is capable of interacting with a user graphically and through voice and gesture-based inputs. As shown, the device 202 is displaying a portrait 212, such as an image of a patient that is currently being treated, and a text box 210 into which a caregiver can enter text. In this example, the caregiver may prefer to speak the input and have it converted automatically to text that will be displayed into the text box 210. A number of components in the device 202 interoperate to permit speech-to-text conversion on the device 202. Exemplary components include a wireless interface, which allows the device 202 to be used mobilely, such as via a 3G or 4G wireless network or a WiFi network. A display controller 214 handles organization of items that are shown on the display of the device 202, which may be a touch screen display. For example, the display controller may receive input from various applications that are executing on the device 202, and may determine where information about each is to be displayed, and the layering of the displays of certain applications, where multiple applications are allowed to run simultaneously.

An input interface 216 coordinates various inputs that may be provided by a user to the device 202. For example, the input interface can receive information from a touch screen part of a display on the device 202, from an accelerometer or magnetometer to interpret motion-based gestures to the device, from a trackball, D Pad, or other pointing mechanism, or from a microphone. The input interface 216 may operate more broadly as an input method editor, that can convert various types of inputs to textual outputs, and may present an API for receiving such outputs. The input interface 216 may be subscribed to by various applications executing on the device, such as a medical records application 218, so that they may receive textual or directional input from the input interface 216, as appropriate. Either the input interface 216 may take part in speech-to-text conversion and may pass converted text to applications such as medical records application 218, or medical records application 218 may interact more directly with a user and may itself process the speech-to-text conversion (in cooperation with the server systems 206, 208).

Medical records application 218 may take various forms, and in one feature, may permit healthcare providers, or caregivers, to record textual, natural language information about a patient or other topic. For example, a physician may record notes to an EMR for a patient after meeting with the patient as part of a check-up or follow-up visit, or after performing surgery on the patient. The medical records application may also integrate a number of other features such as patient tracking, patient triage, staffing and physician assignments, nursing assessments, medication ordering, imaging ordering, prescription writing. Speech-to-text input may be particular helpful to a user of medical records application 218 in areas such as hematology, oncology, geriatrics, orthopaedics, and other such specialties.

User data 220 is data stored on the device 202 either permanently, or downloaded to the device 202 whenever a particular user logs onto the device. The user data may represent preferences that the user may have for interacting with the device 202, such as favored programs to run on the device 202, colors and sounds for the UI of the device, and an identifier for the user, so that other customized information may be retrieved as the user interacts with various applications that are executing on the device 202. The user data 202 may also include data for permitting improved speech-to-text conversion for the device 202, from storing an identifier for the user that will be used to build a custom language model, to the language model itself.

The medical records application 218 may, at various points during its execution, cause information from the medical records server system 208 to be read or written. For example, when a caregiver records information about a particular patient, that information may be caused to be uploaded to the medical records server 208 and stored in EMR data database 232. Such information may then be accessible to other users employing other computing devices who may be interested in learning about the care that that patient has received to the present date.

Alternatively, the medical records application 218 may seek to access data about a patient, such as to see what kind of care has previously been given to the patient and other similar information. In such a situation, an access control system 230 that is part of the medical records server system 208 may restrict access to information about medical records so as to maintain appropriate levels of privacy for such records. The access control system 230 may use familiar techniques, including by requiring credentials from the device 202 and credentials from the user of the device 202 in order to access particular records. For example, a user of the device 202 may be required to provide a login ID and password in order to access information about patients, and the system 200 may maintain different levels of access restriction, so that some types of users can access certain types of data or may be able to access data at deeper levels.

Also, the medical records application 218 may request that information be passed from the medical records server system 208 to the speech server system 206, such as when information about a patient's medical record would be of assistance in generating a language model that is more customized with respect to an interaction of a caregiver with that particular patient (though the patient-specific data could also be transferred only to the device 202). Particular implementations of access control are well known, and various techniques may be used here.

The speech server system 206 may likewise include various components for converting speech received from the device 202 to text that may be passed through network 204 back to device 202. For example, an interface 224 may be responsible for receiving communications from the device 202 and packaging communications for transmission to the device 202. As one example, the interface 224 may check credentials from the device and may also perform initial processing on electronic sound information received from the device 202.

The speech server system 206 may also include user information 226. The user information 226 may be the same as, or similar to, the user data 220 on device 202. For example, where most of the speech-to-text processing occurs on the device 202, a limited amount of user information may be stored on the speech server system 206. As one example, the stored user information may include a user identifier for each such registered user, an identifier that indicates the role of the user so that appropriate speech model information may be accessed for that role (e.g., physician vs. nurse, or cardiologist vs. endocrinologist), and other relevant information. Where the speech server system 206 performs the bulk of the speech-to-text conversion, the user information 226 may also include portions of the speech model that are dedicated to each particular user and additional information needed to perform the complete speech to text conversion.

A recognizer 217 may be located on the device 202, on the speech server system 206, or on both. The recognizer 217 is responsible for converting received sound information to text using language models like those described here. The recognizer 217 may take a number of familiar forms and may use core processing techniques such as those used in the VIA VOICE and DRAGON NATURALLY SPEAKING commercial products. The recognizer 217 may, however, be programmed to construct and consult a dynamically-built language model or grammar to be consulted in converting speech to text. As discussed above and below, for example, the dynamically-built model may, in providing information for speech-to-text conversion, take into account a role of a healthcare provider, and also take into account text from an electronic medical record for a particular patient, among other things.

In this manner, the system 200 may provide for flexible and accurate speech to text recognition. Where a hosted system is employed, and speech server system 206 takes on the bulk of speech to text conversion responsibilities, caregivers may be provided with relatively low cost yet user-friendly devices such as iPhones, Android-powered devices, and IPads. The speech-to-text conversion may occur on such devices or on the hosted system itself (which may be a publicly available system such as GOOGLE APPS, AMAZON S3, and the like), where appropriate security precautions are taken. As such, a flexible and accurate system may be implemented without a major capital expenditure by a provider of healthcare services. Also, a company may provide contract-based medical record services to a wide variety of caregivers, with the ability to have accurate speech-to-text conversion and little up-front cost.

FIG. 3 is a flow chart of an automated transcription process. In general, the process involves building a language model in preparation for receiving speech input from a healthcare professional. The particular language model in this example is made up of a variety of language sub-models which can be applied to convert speech input to text. The models in this example can be dynamically combined into a single model, or a recognizer may combine data from multiple sub-models at run-time to identify the most likely text that corresponds to a spoken input.

The process begins at box 302, where a speech recognition system is started. Such a step may occur, for example, when the user of a personal computer or other computing device turns on a speech-to-text conversion system on the device. In other instances, the system may be started automatically when the device itself is booted, or in other appropriate instances. At box 304, a general language model is loaded into the system. The general language model may represent usage of words and terms across the entirety of a language (English), or at least the entirety of a professional portion of the language (medicine), as opposed to slang and other terms, and may represent the frequency with which certain terms are normally used by the population. The general model may also be field specific, in that it may include medical-related terms, and may weight the medical-related terms more highly than nonmedical-related terms.

At box 306, the process identifies a user of the computing device and the user's role. The user may be identified in a variety of known manners, such as by a provision of credentials in the form of a login name and password, by a fingerprint reader, by other biometric sensors, or in other appropriate manners. The user role may be identified by consulting employment records in a computer system or simply by looking up a role in a table within the system. The role represents information about the user's professional position within an organization. In the examples discussed here, for example, the role may represent the general job title of the caregiver using the system (e.g., physician or nurse), and may also represent the specialty within which that caregiver works. A variety of specialties are shown, for example, in FIG. 1A.

At box 308, a specialty model is added to the overall speech recognition model. Various available specialty models may include terms, and scores for weighting those terms, that correspond to a specialty identified for the role of the user. For example, if the user is an oncologist, the specialty model may include terms relating to cancer and treatment of cancer. The specialty model may be added in a variety of manners for providing a dynamic language model, such as by blending dictionaries for a general model and a specialty model, or by consulting the general and specialty models separately at run-time, and by generating a blended score or ranking or weighting for each of the constituent components for the blended model.

At box 310, the patient is identified and the patient's records are analyzed. Identification of the patient may, as discussed above, occur in a variety of manners, such as by scanning a wristband for the patient, by a caregiver entering the patient's name, room number, or identification number, by selecting the patient from a list of patients, or in another known appropriate manner. The patient's record may be analyzed in a variety of manners also. For example, the textual portions of a patient's electronic medical record that have been added by caregivers may be parsed out of the electronic medical record, and may be provided to a speech recognition engine. Alternatively, an index may be created by the patient record system, where the index is built so as to mask private information of the patient. so as to provide privacy and security as the need may dictate.

At box 312, a medical record model is added to the overall speech model. The medical record model may reflect the occurrence of words in the particular patient's EMR, as identified by the analysis in box 310, so that those words may be added to a speech recognizer dictionary and may also ultimately be weighted very highly in the overall model. In some situations, only the words that would not otherwise be weighted highly already may be added to the overall model, such as proper nouns (e.g., names of other physicians who have treated the patient) and names of particular maladies facing the patient.

Terms that are not in the patient record but that are associated with terms in the patient record may also be added or have higher weight. For example, all terms relating to a particular disease or diagnosis may be added, so that even if an earlier physician did not mention one of the words, they will still be weighted highly under the assumption that a later physician may use those words. Also, if the medical record indicates a particular diagnosis, terms for alternative similar diagnoses may be added, under the assumption that the physician currently treating the patent may reach that alternative diagnosis and speak about it.

At box 314, each of the initial relevant language sub-models have been identified and made ready, and the process begins receiving speech input and converting it to text using the layered language model that is made up of the various sub-models. In this step, information that changes depending on the data entry context in which the user finds themselves is also obtained and can be used to further refine the language model at run-time as the user navigates data entry fields or screens in a medical application, in the various manners discussed above.

Though not discussed here, the identity of the user may also be used to access data relevant to the user's training of the system. The generation of such training data to improve speech-to-text conversion is well-known and is employed in commercial software packages such as DRAGON NATURALLY SPEAKING. Such user-specific data may also be added to the information that is used in this process in order to further improve the accuracy of transcriptions.

Also, certain voice input may be recognized as commands rather than input to be transcribed. Such activity is also well-known. For example, a user may navigate between fields on a form or pages within an application, by using spoken commands, and the process may distinguish such commands (e.g., using a command lexicon) from other spoken input. Other known speech-to-text techniques may also be combined with the process discussed here, as appropriate, to provide an improved user interaction.

Thus, by this process, a user of a computing device may be provided with accurate transcription simply by logging in and identifying a patient who they are treating. The system may dynamically build a language model that takes into account particular aspects of the transcription session, such as a medical record of the patient and a specialty of the treating caregiver. Such information may be used to target the language model to the particular circumstances, and may change dynamically so that the transcription is as accurate as possible.

FIG. 4 is a swim lane diagram of a process showing actions in an automated transcription process that may be carried out by the cooperation of multiple sub-systems. In general, the process is similar to that in FIG. 3A, but shows examples of particular components within a system that may carry out actions in the process. In this particular example, a client device used by a caregiver does little of the speech processing, and instead a central server system is passed sound data from the client device and returns transcribed text to the client device.

The process starts at box 402, where a user logs onto a client device and thereby to a speech server system (box 404). Logging in to the client may cause user settings on the client to change, such as by changing a visual theme of the device to one set by the user, and loading an application or set of applications that are commonly used by the user. Logging in to the server system may cause the server system to open a session and access user files for the user, including by looking up a role for the user and starting to build a language model using that role (with may be represented by the user's ID when the system has previously associated the particular user with a role).

At box 406, the client identifies a patient for whom the user wishes to enter voice-based information. Such identification may occur in various ways, such as by the client device scanning a bar code associates with the patient, the client device being assigned to a room of the patient, the client device determining its location, and the system associating that location with the particular location, the client device taking a photo of the patient, and the system performing facial recognition on the image to identify a match, and in other appropriate manners.

At box 408, the client device submits credentials (e.g., ID and password) and a patient identifier (e.g., name, ID number, facial image, etc.) to the medical records server system. Such credentials may ensure that unauthorized users may not cause personally-identifiable information about a patient to be accessed in any manner. Upon determining that the credentials are legitimate, the medical records server system may identify the relevant records for the patient. For example, the patient's EMR or other appropriate record from the most recent time frame (which may vary based on the specialty of the caregiver, where the time window is wider for an oncologist and narrower for an ENT). At box 412, the medical records server system generates an index form the patient records and sends the index to the speech server system. The generation may include simply sending the raw text from the record to the speech server system, perhaps with meta data such as the forms or fields in which each instance of text was located.

Alternatively, the generating may be more involved and may include actions that restrict the information that the speech server could be able to determine about the patient. For example, certain terms could be genericized or replaced with synonyms, the order of the wording may be changed, and/or the text may be changed entirely to an index that may be sorted alphabetically or by frequency of occurrence, and need not even represent the text with full accuracy, as long as it captures terms that would not be common in the other parts of the language model, and thus would need to be boosted to ensure that notes regarding the patient are accurately transcribed.

At box, 414, the speech server system receives the EMR information and builds a base language model. Such a model may incorporate, for example, information about general and medical usage, information about usage in a specialty for the caregiver, and information from the EMR. The speech server may then be prepared to convert received voice information, and may wait to receive such information.

At box 416, the client receives voice input from the user or caregiver, and transmits the voice input and contextual meta data to the speech server system. The meta data may include information that indicates where, in an application, the user is currently entering information, such as a particular field, form, or page (or the meta data may include the particular data, such as speech model data, that is relevant to the context). At box 420, the speech server incorporates the meta data, such as by identifying words and terms that are relevant to the particular context, into the language model so as to build a complete dynamic language model. Such activity may occur by melding all of the sub-models, or by operating a recognizer that refers to each of the sub-models and gives them appropriate weight.

At box 422, the recognizer or other appropriate component converts the voice data to a textual representation using the various sub-model data, and then transmits the text back to the client device (box 424). Such translation and delivery mechanisms may occur according to a publicly available API, so that the server system may operate relatively independently of the particular application that is requesting the speech-to-text conversion, similar to the operation of the ANDROID operating system when using GOOGLE speech-to-text conversion.

The client device may then display the text in the record where the cursor was located when the caregiver spoke the input. Such interaction may employ programming techniques such as asynchronous JavaScript and XML (AJAX) so as to provide flexible, but relatively time-wise independent operation of the application system.

At boxes 428 and 430, a writing rather than reading action interaction may occur with the medical records server system. In particular, upon receiving the converted text the caregiver may approve of it and cause it to be submitted to the patient's records, whereby it may be uploaded to the medical records server system and stored in an familiar manner. The caregiver may then navigate to another part of the application, and may provide additional spoken input in the manners discussed here.

FIG. 5 is a block diagram of computing devices 500, 550 that may be used to implement the systems and methods described in this document, as either a client or as a server or plurality of servers. Computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 550 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. Additionally computing device 500 or 550 can include Universal Serial Bus (USB) flash drives. The USB flash drives may store operating systems and other applications. The USB flash drives can include input/output components, such as a wireless transmitter or USB connector that may be inserted into a USB port of another computing device. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations described and/or claimed in this document.

Computing device 500 includes a processor 502, memory 504, a storage device 506, a high-speed interface 508 connecting to memory 504 and high-speed expansion ports 510, and a low speed interface 512 connecting to low speed bus 514 and storage device 506. Each of the components 502, 504, 506, 508, 510, and 512, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 502 can process instructions for execution within the computing device 500, including instructions stored in the memory 504 or on the storage device 506 to display graphical information for a GUI on an external input/output device, such as display 516 coupled to high speed interface 508. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 500 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

The memory 504 stores information within the computing device 500. In one implementation, the memory 504 is a volatile memory unit or units. In another implementation, the memory 504 is a non-volatile memory unit or units. The memory 504 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 506 is capable of providing mass storage for the computing device 500. In one implementation, the storage device 506 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 504, the storage device 506, or memory on processor 502.

The high speed controller 508 manages bandwidth-intensive operations for the computing device 500, while the low speed controller 512 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 508 is coupled to memory 504, display 516 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 510, which may accept various expansion cards (not shown). In the implementation, low-speed controller 512 is coupled to storage device 506 and low-speed expansion port 514. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

The computing device 500 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 520, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 524. In addition, it may be implemented in a personal computer such as a laptop computer 522. Alternatively, components from computing device 500 may be combined with other components in a mobile device (not shown), such as device 550. Each of such devices may contain one or more of computing device 500, 550, and an entire system may be made up of multiple computing devices 500, 550 communicating with each other.

Computing device 550 includes a processor 552, memory 564, an input/output device such as a display 554, a communication interface 566, and a transceiver 568, among other components. The device 550 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 550, 552, 564, 554, 566, and 568, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

The processor 552 can execute instructions within the computing device 550, including instructions stored in the memory 564. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. Additionally, the processor may be implemented using any of a number of architectures. For example, the processor 410 may be a CISC (Complex Instruction Set Computers) processor, a RISC (Reduced Instruction Set Computer) processor, or a MISC (Minimal Instruction Set Computer) processor. The processor may provide, for example, for coordination of the other components of the device 550, such as control of user interfaces, applications run by device 550, and wireless communication by device 550.

Processor 552 may communicate with a user through control interface 558 and display interface 556 coupled to a display 554. The display 554 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 556 may comprise appropriate circuitry for driving the display 554 to present graphical and other information to a user. The control interface 558 may receive commands from a user and convert them for submission to the processor 552. In addition, an external interface 562 may be provide in communication with processor 552, so as to enable near area communication of device 550 with other devices. External interface 562 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

The memory 564 stores information within the computing device 550. The memory 564 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 574 may also be provided and connected to device 550 through expansion interface 572, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 574 may provide extra storage space for device 550, or may also store applications or other information for device 550. Specifically, expansion memory 574 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 574 may be provide as a security module for device 550, and may be programmed with instructions that permit secure use of device 550. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.

The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 564, expansion memory 574, or memory on processor 552 that may be received, for example, over transceiver 568 or external interface 562.

Device 550 may communicate wirelessly through communication interface 566, which may include digital signal processing circuitry where necessary. Communication interface 566 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 568. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 570 may provide additional navigation- and location-related wireless data to device 550, which may be used as appropriate by applications running on device 550.

Device 550 may also communicate audibly using audio codec 560, which may receive spoken information from a user and convert it to usable digital information. Audio codec 560 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 550. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 550.

The computing device 550 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 580. It may also be implemented as part of a smartphone 582, personal digital assistant, or other similar mobile device.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), peer-to-peer networks (having ad-hoc or static members), grid computing infrastructures, and the Internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Although a few implementations have been described in detail above, other modifications are possible. Moreover, other mechanisms for performing the systems and methods described in this document may be used. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims. 

What is claimed is:
 1. A computer-implemented method for transcribing spoken input into text, the method comprising: identifying a role that indicates a type of medicine performed by a particular healthcare provider, the role being identified automatically by identifying an electronic record that assigns an identifier for the healthcare provider to the role; using the identified role for the particular healthcare provider to select, with a computerized speech recognition system, a particular statistical language model from among a group of available statistical language models that (a) are each directed to a different particular medical specialty from others of the language models, and (b) each provide probabilities of word distributions for converting audio spoken input to textual output, which probabilities differ from those in others of the statistical language models, and further wherein the particular statistical language model is selected based on identifying a match between the role of the healthcare provider and a particular medical specialty to which the particular statistical language model is directed; receiving spoken input from the healthcare provider; and passing the spoken input through the selected electronic language model to produce textual output that is the spoken input in text form.
 2. The method of claim 1, wherein: identifying the role of the healthcare provider comprises receiving, by the computer system, a login from the healthcare provider, identifying the healthcare provider from the login, and associating a role with the identified healthcare provider, and using the identified role to select the particular language model comprises using the computer system to match the identified role with a specialty-specific language model for the role.
 3. The method of claim 1, wherein using the identified role to select the particular language model for the speech recognition system comprises selecting, from among a plurality of specialty-based language models, a particular specialty-based language model for a medical specialty for the role of the healthcare provider, by matching an identifier for the healthcare provider to a specialty for the healthcare provider.
 4. The method of claim 3, further comprising combining the particular language model with a general language model to create a language model used to produce the textual output.
 5. The method of claim 1, further comprising providing, with the textual output, an identifier for a healthcare patient to which the healthcare provider's spoken input was directed.
 6. The method of claim 1, further comprising identifying a patient for which the healthcare provider is providing spoken input, and generating a language model using language model data directed to information for the identified patient.
 7. The method claim 6, wherein generating the language model data comprises extracting words from prior medical record entries for the patient and using the extracted words for generating the language model.
 8. The method of claim 1, further comprising using training data for the voice of the healthcare provider to provide the particular language model.
 9. The method of claim 1, further comprising identifying information in a category for a data input field in a data entry form that the healthcare provider is providing spoken input into, and modifying the language model used to produce the textual output in response to identifying the information for the category for the data input field.
 10. The method of claim 9, wherein the category is determined from a field label for the field in software code that causes the form to be displayed.
 11. The computer-implemented method of claim 1, further comprising creating a composite language model by layering the selected particular language model with a general language model that represent general human speech and a language model generated using text previously entered into an electronic medical record for a patient being treated by the healthcare provider, and weighting each of the language models in the composite language model.
 12. A computer-implemented system for transcribing spoken input into text, the system comprising: a voice input interface to receive spoken input from a healthcare provider; a language model builder in a computer system operable to: (a) identify a workplace role of the healthcare provider and to select, from a plurality of different language models that are classified in categories assigned to user roles for groups of users and that each La) are directed to a particular medical specialty different than others of the language models and (b) provide probabilities of word distributions for converting audio spoken input to textual output, which probabilities differ from those in others of the statistical language models, and further wherein a particular language model is selected using the identified workplace role, and (b) combine the particular language model with a general language model that represents general human speech, while providing relative weights to each of the particular language model and general language model, and wherein the particular language models and the general language model each correlate sound inputs to textual outputs and provide for speech-to-text conversion; a speech-to-text converter programmed to convert the received spoken input into textual output by applying the language model selected by the language model builder to the received spoken input.
 13. The system of claim 12, wherein the language model builder is programmed to combine the language model directed to the particular medical specialty with a general language model that is not directed to a particular medical specialty, to create a language model that is used to convert speech to text.
 14. The system of claim 12, wherein the speech-to-text converter is further programmed to provide the textual output to an electronic medical record system.
 15. The system of claim 14, wherein the system is further programmed to provide, with the textual output, an identifier for a healthcare patient to which the healthcare provider's spoken input was directed.
 16. The system of claim 12, wherein the language model builder is programmed to identify a patient for which the healthcare provider is providing spoken input, and generate language model data directed to information for the patient, the language model data being provided in a language model that is used to convert spoken input to text.
 17. The system claim 16, wherein generating the language model data comprises extracting words from prior medical record entries for the patient and using the extracted words for generating the language model.
 18. The system of claim 12, wherein the language model builder is further programmed to use training data for the voice of the healthcare provider to produce the language model.
 19. The system of claim 12, wherein the speech-to-text converter is further programmed to identify an information category for a data input field in a data entry form into which the healthcare provider is providing spoken input, and to modify the language model in response to identifying the information for the category.
 20. The system of claim 19, wherein the category correlates to a field, identified from a textual field label in software code that causes the form to be displayed, into which the healthcare provider is providing spoken input.
 21. The system of claim 12, wherein the language model builder is programmed to use the identified role to select the language model for the speech recognition system by selecting a specialty-based language model for a medical specialty for the role of the healthcare provider.
 22. A computer-implemented method for transcribing spoken input into text, the method comprising: identifying a role that indicates a type of medicine performed by a particular healthcare provider based on the particular healthcare provider's working position in an healthcare organization; obtaining information previously entered into an electronic medical record for a patient who is to be treated by the healthcare provider; training a first speech-to-text language model using the information previously entered into the electronic medical record for the patient; generating a combined speech-to-text language model that combines (a) a general speech-to-text language model for general speech, (b) a practice-specific speech-to-text language model directed to a medical specialty for the role that represents the healthcare provider's working position, and (c) the first speech-to-text language model generated from the information previously entered into the electronic medical record for the patient, wherein the combined speech-to-text language model provides for speech-to-text conversion using probabilities of word distributions for converting audio spoken input to textual output; receiving spoken input from the healthcare provider; and producing textual output for the spoken input by passing audio of the spoken input through the combined language model.
 23. The computer-implemented method of claim 22, further comprising: identifying a field to which the healthcare provider is currently providing input; and adding, to the combined language model, data for a language model particular to the identified field.
 24. The computer-implemented method of claim 23, wherein the spoken input is received at a central server system separate from a device on which the spoken input was provided, and the textual output is provided over a network to the device on which the spoken input was provided. 